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ABSTRACT 

To accommodate the needs of large-scale distributed P2P systems, 
scalable data management strategies are required, allowing appli- 
cations to efficiently cope with continuously growing, highly dis- 
tributed data. This paper addresses the problem of efficiently stor- 
ing and accessing very large binary data objects (blobs). It proposes 
an efficient versioning scheme allowing a large number of clients 
to concurrently read, write and append data to huge blobs that are 
fragmented and distributed at a very large scale. Scalability un- 
der heavy concurrency is achieved thanks to an original metadata 
scheme, based on a distributed segment tree built on top of a Dis- 
tributed Hash Table (DHT). Our approach has been implemented 
and experimented within our BlobSeer prototype on the Grid' 5000 
testbed, using up to 175 nodes. 

1. INTRODUCTION 

Peer-to-peer (P2P) systems have extensively been studied during 
the last years as a means to achieve very large scale scalability for 
services and applications. This scalability is generally obtained 
through software architectures based on autonomic peers which 
may take part in a collaborative work process in a dynamic way: 
they may join or leave at any time, publish resources or use re- 
sources made available by other peers. P2P environments typically 
need scalable data management schemes able to cope with a grow- 
ing number of clients and with a continuously growing data, (e.g. 
data streams), while supporting a dynamic and highly concurrent 
environment. 

As the usage of the P2P approach extends to more and more appli- 
cation classes, the storage requirements for such a large scale are 
becoming increasingly complex due to the rate, scale and variety of 
data. In this context, storing, accessing and processing very large, 
unstructured data is of utmost importance. Unstructured data con- 
sists of free-form text such as word processing documents, e-mail, 
Web pages, text files, sources that contain natural language text, 
images, audio and video streams to name a few. 

Studies show more than 80% [8] of data globally in circulation is 
unstructured. On the other hand, data sizes increase at a dramatic 
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level: for example, medical experiments [15] have an average re- 
quirement of 1 TB per week. Large repositories for data analysis 
programs, data streams generated and updated by continuously run- 
ning applications, data archives are just a few examples of contexts 
where unstructured data that easily reaches the order of 1 TB. 

Unstructured data are often stored as a binary large object (blob) 
within a database or a file. However, these approaches can hardly 
cope with blobs which grow to huge sizes. To address this issue, 
specialized abstractions like MapReduce [5] and Pig-Latin [14] 
propose high-level data processing frameworks intended to hide the 
details of parallelization from the user. Such platforms are imple- 
mented on top of huge object storage and target high performance 
by optimizing the parallel execution of the computation. This leads 
to heavy access concurrency to the blobs, thus the need for the stor- 
age layer to offer support in this sense. Parallel and distributed file 
system also consider using objects for low-level storage [6, 17, 7]. 
In other scenarios, huge blobs need to be used concurrently at the 
highest level layers of applications directly: high-energy physics 
applications, multimedia processing [4] or astronomy [13]. 

In this paper we address the problem of storing and efficiently ac- 
cessing very large unstructured data objects [11, 15] in a distributed 
environment. We focus on the case where data is mutable and po- 
tentially accessed by a very large number of concurrent, distributed 
processes, as it is typically the case in a P2P system. In this context, 
versioning is an important feature. Not only it allows to roll back 
data changes when desired, but it also enables cheap branching 
(possibly recursively): the same computation may proceed inde- 
pendently on different versions of the blob. Versioning should ob- 
viously not significantly impact access performance to the object, 
given that objects are under constant heavy access concurrency. On 
the other hand, versioning leads to increased storage space usage 
and becomes a major concern when the data size itself is huge. 
Versioning efficiency thus refers to both access performance under 
heavy load and reasonably acceptable overhead of storage space. 

Related work has been carried out in the area of parallel and dis- 
tributed file systems [1, 3, 7] and archiving systems [18]: in all 
these systems the metadata management is centralized and mainly 
optimized for data reading and appending. In contrast, we rely 
on metadata decentralization, in order to introduce an efficient ver- 
sioning scheme for huge, large-scale distributed blobs that are con- 
currently accessed by an arbitrarily large number of clients which 
may read, write or append data to blobs. Our algorithm guaran- 
tees atomicity while still attaining good data access performance. 
Our approach splits a huge blob into small fixed-sized pages that 
are scattered across commodity data providers. Rather than updat- 



ing the current pages, completely new pages are generated when 
clients request data modifications. The corresponding metadata is 
"weaved" with old metadata in such way as to offer a complete 
virtual view of both the past version and the current version of 
the blob. Metadata is organized as a segment-tree like structure 
(see Section 4) and is also scattered across the system using a Dis- 
tributed Hash Table (DHT). Distributing data and metadata not only 
enables high performance through parallel, direct access I/O paths, 
but also favors efficient use of storage space: although a full vir- 
tual view of all past versions of the blob is offered, real space is 
consumed only by the newly generated pages. 

Our approach has been implemented and experimented within our 
prototype, called BlobSeer: a binary large object management ser- 
vice. In previous work [13, 12] we have handled versioning in a 
static way: blobs were considered huge storage objects of prede- 
fined, fixed sizes that are first allocated, then manipulated by read- 
ing and writing parts of them. However, in most real life scenarios, 
blobs need to dynamically grow, as new data is continuously gath- 
ered. This paper improves on our previous work as follows. First, 
we introduce support for dynamic blob expansion through atomic 
append operations. Second, wc introduce cheap branching, allow- 
ing a blob to evolve in multiple, completely different ways through 
writes and appends starting from a particular snapshot version. This 
may be very useful for exploring alternative data processing algo- 
rithms starting from the same blob version. 

The paper is organized as follows. Section 2 restates the specifi- 
cation of the problem in a more formal way. Section 3 provides 
an overview of our design and precisely describes how data access 
operations are handled. The algorithms used for metadata manage- 
ment are discussed in Section 4. Section 5 provides a few imple- 
mentation details and reports on the experimental evaluation per- 
formed on multi-site grid testbed. On-going and future work is 
discussed in Section 6. 



2. SPECIFICATION 

Our goal is to enable efficient versioning of blobs in a highly con- 
current environment. In such a context, an arbitrarily large number 
of n clients compete to read and update the blob. A blob grows as 
clients append new data and its contents may be modified by partial 
or total overwriting. 

Each time the blob gets updated, a new snapshot reflecting the 
changes and labeled with an incremental version is generated, rather 
than overwriting any existing data. This allows access to all past 
versions of the blob. In its initial state, we assume any blob is con- 
sidered empty (its size is 0) and is labeled with version 0.(Note that 
our previous work [13, 12] was relying on different assumptions: 
the blob size was statically specified at the initialization time and 
could not be extended.) 

Updates are totally ordered: if a snapshot is labeled by version k, 
then its content reflects the successive application of all updates 
l..fc — 1 on the initial empty snapshot in this order. Thus generating 
a new snapshot labeled with version k is semantically equivalent to 
applying the update to a copy of the snapshot labeled with version 
fe — 1. As a convention, we will refer to the snapshot labeled with 
version k simply by snapshot k from now on. 



To create a new blob, one must call the CREATE primitive: 

id = CREATE( ) 

This primitive creates the blob and associates to it an empty snap- 
shot 0. The blob will be identified by its id (the returned value). 
The id is guaranteed to be globally unique. 

vw = WRITE(id, buffer, offset, size) 

A WRITE initiates the process of generating a new snapshot of the 
blob (identified by id) by replacing size bytes of the blob starting 
at offset with the contents of the local buffer. 

The WRITE does not know in advance which snapshot version it 
will generate, as the updates are totally ordered and internally man- 
aged by the storage system. However, after the primitive returns, 
the caller learns about its assigned snapshot version by consulting 
the returned value vw. The update will eventually be applied to the 
snapshot viv — 1, thus effectively generating the snapshot vw. This 
snapshot version is said to be published when it becomes available 
to the readers. Note that the primitive may return before snapshot 
version vw is published. The publication time is unknown, but 
the WRITE is atomic in the sense of [9]: it appears to execute in- 
stantaneously at some point between its invocation and completion. 
Completion in our context refers to the moment in time when the 
newly generated snapshot vw is published. 

Finally, note that the WRITE primitive fails if the specified offset is 
larger than the total size of the snapshot vw — 1. 

va = APPEND(id, buffer, size ) 

APPEND is a special case of WRITE, in which the offset is implic- 
itly assimied to be the size of snapshot va — 1. 

READ(id, V, buffer , offset , size ) 

A READ results in replacing the contents of the local buffer with 
size bytes from the snapshot version v of the blob id, starting at off- 
set, if v has already been published. If v has not yet been published, 
the read fails. A read fails also if the total size of the snapshot v is 
smaller than offset + size. 

Note that the caller of the READ primitive must be able to learn 
about the new versions that are published in the system in order to 
provide a meaningful value for the v argument. The blob size cor- 
responding to snapshot v is also required, to enable valid s from the 
blob to be read. The following primitives arc therefore provided: 

V = GET_RECENT(id) 

This primitive returns a recently published version blob id. The 
system guarantees that v > max{vk), for all snapshot versions Vk 
published before the call. 

size = GET_SIZE(id, v) 

This primitive returns the size of the blob snapshot corresponding 
to version v of the blob identified by id. The primitive fails if v has 
not been published yet. 

Since WRITE and APPEND may return before the corresponding 
snapshot version is published, a subsequent READ attempted by 
the same client on the very same snapshot version may fail. How- 
ever, it is desirable to be able to provide support for "read your 
writes" consistency. For this purpose, the following primitive is 
added: 



2.1 Interface 



SYNC(id, v) 



The caller of SYNC blocks until snapshot v of blob id is published. 

Our system also introduces support for branching, to allow alterna- 
tive evolutions of the blob through WRITE and APPEND starting 
from a specified version. 

bid = BRANCH(id, v) 

This primitive virtually duplicates the blob identified by id by cre- 
ating a new blob identified by bid. This new blob is identical to 
the original blob in every snapshot up to (and including) v. The 
first WRITE or APPEND on the blob bid will generate a new snap- 
shot v -|- 1 for blob bid. The primitive fails if version v of the blob 
identified by id has not been published yet. 

2.2 Usage scenario 

Let us consider a simple motivating scenario illustrating the use of 
our proposed interface. A digital processing company offers online 
picture enhancing services for a wide user audience. Users up- 
load their picture, select a desired filter, such as sharpen and down- 
load their picture back. Most pictures taken with a modem camera 
include some metadata in their header, describing attributes like 
camera type, shutter speed, ambient light levels, etc. Thousands of 
users upload pictures every day, and the company would like to an- 
alyze these pictures for statistical purposes. For example it might 
be interesting to find out the average contrast quality for each cam- 
era type. 

One option to address this problem would be to store the pictures 
in a huge database and perform some query when needed. Unfortu- 
nately, pictures are unstructured data: metadata is not standardized 
and may differ from one camera brand to another. Thus, no con- 
sistent schema can be designed for query optimization. Moreover, 
it is unfeasible to store variable binary data in a database, because 
database systems are usually fine-tuned for fixed-sized records. 

Let us now consider using a virtually unique (but physically dis- 
tributed) blob for the whole dataset. Pictures are APPEND' ed con- 
currently to the blob from multiple sites serving the users, while a 
recent version of the blob is processed at regular intervals: a set of 
workers READ disjoint parts of the blob, identify the set of pic- 
tures contained in their assigned part, extract from each picture the 
camera type and compute a contrast quality coefficient, and finally 
aggregate the contrast quality for each camera type. This type of 
computation fits in the class of map-reduce applications. The map 
phase generates a set of (key, value) pairs from the blob, while the 
reduce phase computes some aggregation function over all values 
corresponding to the same key. In our example the keys correspond 
to camera types. 

Many times during a map phase it may be necessary to overwrite 
parts of the blob. For example, a complex image processing was 
necessary for some pictures and overwriting the picture with its 
processed version saves computation time when processing future 
blob versions. Surely, a map with an idempotent reduce reaches 
the same result with no need to write, but at the cost of creating an 
output that duplicates the blob, which means an unacceptable loss 
of storage space. 

3. DESIGN OVERVIEW 

Our system is striping-based: a blob is made up of blocks of a fixed 
size psize, referred to as pages. Each page is assigned to a fixed 
range of the blob (fe x psize, (fe -|- 1) x psize — 1). Any range that 
covers a full number of pages is said to be aligned. These pages 



are distributed among storage space providers. Metadata facilitates 
access to a range (offset, size) for any existing version of a blob 
snapshot, by associating such a range with the page providers. 

A WRITE or APPEND generates a new set of pages corresponding 
to the offset and size requested to be updated. Metadata is then 
generated and "weaved" together with the old metadata in such way 
as to create the illusion of a new incremental snapshot that actually 
shares the unmodified pages with the older versions. Thus, two 
successive snapshots v and v+1 physically share the pages that fall 
outside of the range of the update that generated snapshot v + 1. 

Consider a read for snapshot v whose range fits exactly a single 
page. The physical page that is accessed was produced by some 
update that generated snapshot w, with w < v such that w is the 
highest snapshot version generated by an update whose range in- 
tersects the page. Therefore, when the range of a READ covers 
several pages, these pages may have been generated by different 
updates. Updates that do not cover full pages are handled in a 
slightly more complex way, but not discussed here, due to space 
constraints. 

3.1 Architecture overview 

Our distributed service consists of communicating processes, each 
fulfllUng a particular role. 

Clients may create blobs and read, write and append data to them. 
There may be multiple concurrent clients, and their number 
may dynamically vary in time. 

Data providers physically store the pages generated by WRITE 
and APPEND. New data providers may dynamically join 
and leave the system. 

The provider manager keeps information about the available stor- 
age space. Each joining provider registers with the provider 
manager. The provider manager decides which providers 
should be used to store the generated pages according to 
a strategy aiming at ensuring an even distribution of pages 
among providers. 

The metadata provider physically stores the metadata allowing 
clients to find the pages corresponding to the blob snapshot 
version. Note that the metadata provider may be implemented 
in a distributed way. However, for the sake of readabiUty, 
we do not develop this aspect in our presentation of the al- 
gorithms we propose for data access. Distributed metadata 
management is addressed in detail in Section 4. 

Tlie version manager is the key actor of the system. It registers 
update requests (APPEND and WRITE), assigning snapshot 
version numbers, end eventually publishes these updates, guar- 
anteeing total ordering and atomicity. 



Our design targets scalability and large-scale distribution. There- 
fore, we make a key design choice in avoiding a static role distribu- 
tion: any physical node may play one or multiple roles, as a client, 
or by hosting data or metadata. This scheme makes our system 
suitable for a P2P environment. 

3.2 Reading data 



The READ primitive is presented in Algorithm 1 . The client con- 
tacts the version manager first, to check whether the supplied ver- 
sion V has been published and fails if it is not the case. Otherwise, 
the client needs find out what pages fully cover the requested offset 
and size for version v and where they are stored. To this purpose, 
the client contacts the metadata provider and receives the required 
metadata. Then it processes the metadata to generate a set of page 
descriptors PD. PD holds information about all pages that need 
to be fetched: for each page its globally unique page id pid, its 
index i in the buffer to be read and the page provider that stores 
it. (Note that, for the sake of simplicity, we consider here the case 
where each page is stored on a single provider. Replication strate- 
gies will be investigated in future work.) Having this information 
assembled, the client fetches the pages in parallel and fills the local 
buffer. Note that the range defined by the supplied offset and size 
may not be aligned to full pages. In this case the client may request 
only a part of the page from the page provider. 



Algorithm 1 READ 
Require: The snapshot version v 
Require: The local buffer to read to 
Require: The offset in the blob 
Require: The «'ze to read 

1 : if V is not published tlien 

2: fail 

3: end if 

4: PD ^ READ_METADATA(v, offset, size) 

5: for all {pid, i, provider) € PD in parallel do 

6: read pid from provider into buffer at i x psize 

7: end for 

8: return success 



Note that, at this stage, for readability reasons, we have not devel- 
oped yet the details of metadata management. However, the key 

mechanism that enables powerful properties such as efficient fine- 
grain access under heavy concurrency relates directly to metadata 
management, as discussed in Section 4. 

3.3 Writing and appending data 

Algorithm 2 describes how the WRITE primitive works. For sim- 
plicity, we first consider here aligned writes only, with page size 
psize. Unaligned writes are also handled by our system, but, due 
to space constraints, this case is not discussed here. The client first 
needs to determine the number of pages n that cover the range. 
Then, it contacts the provider manager requesting a list of n page 
providers PP (one for each page) that are capable of storing the 
pages. For each page in parallel, the client generates a globally 
unique page id pid, contacts the corresponding page provider and 
stores the contents of the page on it. It then updates the set of 
page descriptors PD accordingly. This set is later used to build 
the metadata associated with this update. After successful com- 
pletion of this stage, the client contacts the version manager and 
registers its update. The version manager assigns to this update a 
new snapshot version vw and communicates it to the client, which 
then generates new metadata and "weaves" (details in section 4) it 
together with the old metadata such that the new snapshot vw ap- 
pears as a standalone entity. Finally it notifies the version manager 
of success, and returns successfully to the user. At this point, the 
version manager takes the responsibility of eventually publishing 
vw. 

APPEND is almost identical to the WRITE, with the difference 
that an offset is directly provided by the version manager at the 



Algorithm 2 WRITE 

Require: The local buffer used to apply the update. 

Require: The offset in the blob. 

Require: The size to write. 

Ensure: The assigned version vw to be published. 

1: n ^ (offset + size) /psize 

2: PP ^ the list of n page providers 

3: PD 

4: for all < i < n in parallel do 

5: pid <— unique page id 

6: provider <— PP[i] 

7: store page pid from buffer at i x psize to provider 
8: PD ^ PD U (pid, i, provider) 
9: end for 

10: 'tnv ^ assigned snapshot version 
II: BUILD_METADATA(vw, offset, size, PD) 
12: notify version manager of success 
13: return vw 



time when snapshot version is assigned. This offset is the size of 
the previously published snapshot version. 

Note that our algorithm enables a high degree of parallelism: for 
any update (WRITE or APPEND), pages may be asynchronously 
sent and stored in parallel on providers. Moreover, multiple clients 
may perform such operations with full parallelism: no S5mchro- 
nization is needed for writing the data, since each update generates 
new pages. Some synchronization is necessary when writing the 
metadata, however the induced overhead is low (see Section 4). 

4. METADATA MANAGEMENT 

Metadata stores information about the pages which make up a given 
blob, for each generated snapshot version. We choose a simple, yet 
versatile design, allowing the system to efficiently build a full view 
of the new snapshot of the blob each time an update occurs. This 
is made possible through a key design choice: when updating data, 
new metadata is created, rather than updating old metadata. As 
we will explain below, this decision significantly helps us provide 
support for heavy concurrency, as it favors independent concurrent 
accesses to metadata without synchronization. 

4.1 The distributed metadata tree 

We organize metadata as a distributed segment tree [19], one asso- 
ciated to each snapshot version of a given blob id. A segment tree 
is a binary tree in which each node is associated to a range of the 
blob, delimited by offset and size. We say that the node covers the 
range {offset, size). For each node that is not a leaf, the left child 
covers the first half of the range, and the right child covers the sec- 
ond half. Each leaf covers a single page. We assume the page size 
psize is a power of two. 

For example, Figure 1(a) depicts the structure of the metadata for 
a blob consisting of four pages. We assume the page size is I. 
The root of the tree covers the range (0, 4) (i.e., offset = 0, size = 
4 pages), while each leaf covers exactly one page in this range. 

Tree nodes are stored on the metadata provider in a distributed way, 
using a simple DHT (Distributed Hash Table). This choice favors 
concurrent access to metadata, as explained in Section 4.2. Each 
tree node is identified uniquely by its version and range specified 
by the offset and size it covers. Inner nodes hold the version of the 



(a) The metadata after a write (b) The metadata after overwriting two (c) The metadata after an append of one page 
of four pages pages 



Figure 1: Metadata representation 



left child vl and the version of the right child vr, while leaves hold 
the page id pid and the provider that store the page. 



Sharing metadata across snapshot versions. Such a meta- 
data tree is created when the first pages of the blob are written, for 
the range covered by those pages. Note that rebuilding a full tree 
for subsequent updates would be space- and time-inefficient. This 
can be avoided by sharing the existing tree nodes that cover the 
blob ranges which do not intersect with the range of the update to 
be processed. Of course, new tree nodes are created for the ranges 
that do intersect with the range of the update. These new tree nodes 
are "weaved" with existing tree nodes generated by past updates, 
in order to build a a new consistent view of the blob, correspond- 
ing to a new snapshot version. This process is illustrated on Fig- 
ures 1(a) and 1(b). Figure 1(a) corresponds to an initial snapshot 
version (1) of a 4-page blob, whereas Figure 1(b) illustrates how 
metadata evolves when pages 2 and 3 of this blob are modified 
(snapshot version 2). Versions are color-coded: the initial snapshot 
1 is white, snapshot 2 is grey. When a WRITE updates the second 
and third page of the blob, the grey nodes arc generated: (1, 1), 
(2, 1), (0,2), (2,2), (0,4). These new grey nodes are "weaved" 
with existing white nodes corresponding to the unmodified pages 1 
and 4. Therefore, the left child of the grey node that covers (0, 2) 
is the white node that covers (0,1); similarly, the right child of the 
grey node that covers (2, 2) is the white node covering (3, 1). 

Expanding the metadata tree. APPEND operations make 
the blob "grow": consequently, the metadata tree gets expanded, 

as illustrated on Figure 1(c). Continuing our example, we assume 
that the WRITE generating snapshot version 2 is followed by an 
APPEND for one page, which generates snapshot version 3 of the 
blob (black-colored). New metadata tree nodes are generated, to 
take into account the creation of the fifth page. The left child of the 
new black root,(0, 8) is the old, grey root of snapshot 2, (0, 4). 

4.2 Accessing metadata: algorithms 

Reading metadata. During a READ, the metadata is accessed 
(Algorithm 3) in order to find out what pages fully cover the re- 
quested range R delimited by ojfset and size. It is therefore neces- 
sary to traverse down the segment tree, starting from the root that 



corresponds to the requested snapshot version. A node N that cov- 
ers segment Rn is explored if the intersection of R,n with R is 
not empty. All explored leaves reached this way are used to build 
the set of page descriptors PD that is used to fetch the contents of 
the pages. To simplify the presentation of the algorithm, we intro- 
duce two primitives. GET_NODE{v, offset, size) fetches and 
returns the contents of the node identified by the supplied version, 
offset and size from the metadata provider. Similarly, GET_ROOT(v) 
fetches and returns the root of the tree corresponding to version v. 



Algorithm 3 READ_META 
Require: The snapshot version v 
Require: The offset of the blob 
Require: The size to read 
Ensure: The set of page descriptors PD 

1: NS '^{GET_ROOT{v)} 

2: while NS^%Ao 

3: <— extract node from NS 

4: MN is leaf then 

5: i <— {N. offset — offset) /psize 

6: PD <— PD U {N.pid, i, N. provider) 

7: else 

8: if {offset, size) intersects [N .of f set, N. size/2) then 
9: NS ^ NSuGET_NODE{N.vl, N.offset, N.size/2) 

10: end if 

11: if {offset, size) intersects {N.offset + 

N.size/2, N.size/2) then 
12: NS ^ NS U GET_NODE{N.vr, N.offset + 

N.size/2, N.size/2) 
13: end if 
14: end if 
15: end while 



Writing metadata. For each update (WRITE or APPEND) pro- 
ducing snapshot version vw, it is necessary to build a new metadata 
tree (possibly sharing nodes with the trees corresponding to pre- 
vious snapshot versions). This new tree is the smallest (possibly 
incomplete) binary tree such that its leaves are exactly the leaves 
covering the pages of range that is written. The tree is built bottom- 
up: first the leaves corresponding to the newly generated pages are 
built, then the inner nodes P are built up to (and including) the root. 
This process is illustrated in Algorithm 4. Note that inner nodes 



may have children which do not intersect the range of the update 
to be processed. For any given snapshot version vw, these nodes 
form the set of border nodes Bow When building the metadata 
tree, the algorithm needs to compute the corresponding versions of 
such children nodes ( vl or vr). For simplicity, we do not develop 
here how the set of border nodes is computed before building the 
tree. 

Algorithm 4 BUILD_META 

Require: The assigned snapshot version vw. 

Require: The offset in the blob. 

Require: The size to write. 

Require: The set of page descriptors PD. 

Ensure: 
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g^0 

Byui <— build the set of border nodes 
for all {pid, i, provider) € PD do 

N ^ NEW_NODE(vw, offset + i x psize,psize) 

N.pid <— pid 

N .provider ^ provider 

Q^QU{N} 
end for 

while Q^0do 

N <— extract a node from Q 
if A'^ is not root then 

\iN.offset%{2 X N.size) = then 

P ^ NEW_NODE{vw,N.offset,2 X N.size) 
position <— LEFT 
else 

P ^ NEW_NODE{vw,N. offset - N.size,2 x 
N.size) 

position ^ RIGHT 
end if 

if P note y then 

if position = LEFT then 
P.vl <— vw 

P.vr *— extract right child version from By^ 
end if 

it position = RIGHT then 
P.vr <— vw 

P.vl <— extract left child version from Byu> 
end if 

Q^QU{P} 
V ^VU{P} 
end if 
end if 
end while 

for a\\ N eV in parallel do 

write N to the metadata provider 
end for 



Why WRITES and APPENDs may proceed in parallel. 
Building new metadata tree nodes might seem to require serializa- 
tion. Consider two concurrent clients Ci and C2. Let us assume 
that, after having written their pages in parallel, with no synchro- 
nization, that contact the version manager to get their snapshot ver- 
sions. Let us assume Ci gets snapshot versions vw and C2 gets 
snapshot version vw + 1. The two clients should then start to build 
their metadata tree nodes concurrently. However, it may seem that 
client C2 must wait for client Ci to complete writing metadata, be- 
cause tree nodes built by Ci may actually be part of the set of bor- 



der nodes of C2, which is used by C2 to build its own tree nodes. 

As our goal is to favor concurrent WRITEs and APPENDs (and, 
consequently, concurrent metadata writing), we choose to avoid 
such a serialization by introducing a small computation overhead. 
Note that C2 may easily compute the border node set B2 by de- 
scending the tree (starting from the root) corresponding to snapshot 
vw generated by Ci. It may thus gather all left and right children 
of the nodes which intersect the range of the update corresponding 
to snapshot -I- 1. If the root of snapshot vw + 1 covers a larger 
range than the root of snapshot vw, then the set of border nodes 
contains exactly one node: the root of snapshot vw. Our main dif- 
ficulty comes from the nodes that are build by Ci that can actually 
be part of set of border nodes of C2, because all other nodes of the 
set of border nodes of C2 can be computed as described above, by 
using the root of the latest published snapshot vp instead of the root 
of vw. 

Our solution to this is to introduce a small computation overhead 
in the version manager, who will supply the problematic tree nodes 
that are part of the set of border nodes directly to the writer at the 
moment it is assigned a new snapshot version. This is possible 
because the range of each concurrent WRITE or APPEND is regis- 
tered by the version manager. Such operations are considered said 
to be concurrent with the update being processed if they have been 
assigned a version number (after writing their data), but they have 
not been published yet (e.g. because they have not finished writ- 
ing the corresponding metadata). By iterating through the concur- 
rent WRITE and APPEND operations (which have been assigned 
a lower snapshot version), the version manager will build the par- 
tial set of border nodes and provide it to the writer when it asks 
for the snapshot version. The version manager also supplies a re- 
cently published snapshot version that can be used by the writer to 
compute the rest of the border nodes. Armed with both the partial 
set of border nodes and a published snapshot version, the writer is 
now able to compute the full set of border nodes with respect to the 
supplied snapshot version. 

4.3 Discussion 

Our approach enables efficient versioning both in terms of perfor- 
mance imder heavy load and in terms of required storage space. 
Below we discuss some of the properties of our system. 



Support for heavy access concurrency. Given that updates 
always generate new pages instead of overwriting older pages, READ, 
WRITE and APPEND primitives called by concurrent clients may 
fully proceed in parallel at the application-level, with no need for 
explicit synchronization. This is a key feature of our distributed al- 
gorithm. Internally, synchronization is kept minimal: distinct pages 
may read or updated in a fully parallel way; data access serializa- 
tion is only necessary when the same provider is contacted at the 
same time by different clients, either for reading or for writing. 
It is important to note that the strategy employed by the provider 
manager for page-to-provider distribution plays a central role in 
minimizing such conflicts that lead to serialization. 

Note on the other hand that internal serialization is necessary when 
two updates (WRITE or APPEND) contact the version manager 
simultaneously to obtain a snapshot version. This step is hoverer 
negligible when compared to the full operation. 

Finally, the scheme we use for metadata management also aims 



at enabling parallel access to metadata as much as possible. The 
situations where synchronization is necessary have been discussed 
in Section 4.2. 



Efficient use of storage space. Note that new storage space is 
necessary for newly written pages only: for any WRITE or APPEND, 
the pages that are NOT updated are physically shared by the newly 
generated snapshot version with the previously published version. 
This way, the same physical page may be shared by a large number 
of snapshot versions of the same blob. Moreover, as explained in 
Section 4, multiple snapshot versions may partially share metadata. 



Atomicity. Recent arguments [9, 10, 16] stress the need to pro- 
vide atomicity for operations on objects. An atomic storage algo- 
rithm must guarantee any read or write operation appears to execute 
instantaneously between its invocation and completion despite con- 
current access from any number of clients. In our architecture, the 
version manager is responsible for assigning snapshot versions and 
for publishing them upon successful completion of WRITEs and 
APPENDS. Note that concurrent WRITEs and APPENDS work 
in complete isolation, as they do not modify, but rather add data 
and metadata. It is then up the the version manager to decide when 
their effects will be revealed to the other clients, by publishing their 
assigned versions in a consistent way. The only synchronization 
occurs at the level at the version manager. In our current imple- 
mentation, atomicity is easy to achieve, as the version manager is 
centralized. Using a distributed version manager will be addressed 
in the near future. 



5. EXPERIMENTAL EVALUATION 

We experimented and evaluated the approach developed above within 
the framework of our BlobSccr prototype. To implement the meta- 
data provider in a distributed way, we have developed a custom 
DHT (Distributed Hash Table), based on simple static distribution 
scheme. This allows metadata to be efficiently stored and retrieved 
in parallel. 

Evaluations have been performed on the Grid' 5000 [2] testbed, an 
experimental Grid platform gathering 9 sites geographically dis- 
tributed in France. In each experiment, we used at most 175 nodes 
of the Rennes site of Grid'5000. Nodes are outfitted with x86_64 
CPUs and 4 GB of RAM. Intracluster bandwidth is 1 Gbit/s (mea- 
sured: 117.5MB/S for TCP sockets with MTU = 1500 B), latency 
is 0.1 ms. 

We first ran a set of experiments to evaluate the impact of our meta- 
data scheme over the performance of the APPEND operation (in 
terms of bandwidth), while the blob size continuously grows. In 
these experiments, a single client process creates an empty blob 
and starts appending to it, while we constantly monitor the band- 
width of the APPEND operation. 

We use two deplojonent settings. We deploy each the version man- 
ager and the provider manager on two distinct dedicated nodes, and 
we co-deploy a data provider and a metadata provider on the other 
nodes, using a total of 50 data and metadata providers in the first 
setting and 175 data and metadata providers in the second setting. 
In each of the two settings, a client creates a blob and starts append- 
ing 64 MB of data to the blob. This process is repeated two times, 
using a different page size each time: 64 KB and 256 KB. 



Results are shown in Figure 2(a): they show that a high bandwidth 
is maintained even when the blob grows to large sizes, thus demon- 
strating a low metadata overhead. A slight bandwidth decrease is 
observed when the number of pages reaches a power of two: this 
corresponds to the expected metadata overhead increase caused by 
adding a new level to the metadata tree. 

A second set of experiments evaluates the bandwidth performance 
when multiple readers access disjoint parts of the same blob. We 
use again 175 nodes for these experiments. As in the previous case, 
the version manager and the provider manager are deployed on two 
dedicated nodes, while a data provider and a metadata provider are 
co-deployed on the remaining 173 nodes. 

In a first phase, a single client appends data to the blob until the 
blob grows to 64 GB. Then, we start reading the first 64 MB from 
the blob with a single client. This process is repeated 100 times and 
the average bandwidth computed. Then, we increase the number of 
readers to 100. The readers are deployed on nodes that already 
run a data and metadata provider. They concurrently read distinct 
64 MB chunks from the blob. Again, the process in repeated 100 
times, and the average read bandwidth is computed. In a last step 
the number of concurrent clients is increased to 175 and the same 
process repeated, obtaining another average read bandwidth. 

Note that, even there is no conflict for accessing the same data, the 
readers concurrently traverse the metadata tree, whose nodes may 
be concurrently requested by multiple readers. Note also that, as 
the number of pages of the blob is very large with respect to the to- 
tal number of available metadata and data providers, each physical 
node may be subject to heavily concurrent requests. 

The obtained results are represented in Figure 2(b), where the av- 
erage read bandwidth is represented as a function of the number 
of concurrent readers, interpolated to fit the three experiments. We 
observe a very good scalability: the read bandwidth drops from 
60MB/S for a single reader to 49MB/s for 175 concurrent readers. 

6. CONCLUSION 

As more and more application classes and services start using the 
P2P paradigm in order to achieve high scalability, the demand for 
adequate, scalable data management strategies is ever higher. One 
important requirement in this context is the ability to efficiently 
cope with accesses to continuously growing data, while support- 
ing a highly concurrent, highly distributed environment. We ad- 
dress this requirement for the case of huge unstructured data. We 
propose an efficient versioning scheme allowing a large number of 
clients to concurrently read, write and append data to binary large 
objects (blobs) that are fragmented and distributed at a very large 
scale. Our algorithms guarantees atomicity while still achieving a 
good data access performance. To favor scalability under heavy 
concurrency, we rely on an original metadata scheme, based on 
a distributed segment tree that we build on top of a Distributed 
Hash Table (DHT). Rather than modifying data and metadata in 
place when data updates are requested, new data fragments are cre- 
ated, and the corresponding metadata are "weaved" with the old 
metadata, in order to provide a new view of the whole blob, in a 
space-efficient way. This approach favors independent, concurrent 
accesses to data and metadata without synchronization and thereby 
enables a high throughput under heavy concurrency. The proposed 
algorithms have been implemented within our BlobSeer prototype 
and experimented on the Grid'5000 testbed, using up to 175 nodes: 
the preliminary results suggest a good scalability with respect to the 
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data size and to the number of concurrent accesses. Further experi- 
mentations are in progress, which aim at demonstrating the benefits 
of data and metadata distribution. We also intend to investigate ex- 
tensions to our approach allowing to add support for volatility and 
failures. 
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